Our group is interested in housing selection and security issues in New York city. For this project, we plan to use NYPD public safety data and New York city zip code data as major data sources, combining variables that represent total occupied housing units and owner-occupied units to create a new variable that measures the percentage of owner-occupied units per NYC zip code.
#Load the data
nypd_precinct <- read.csv("NYPD_Complaint_Data_Current__Year_To_Date_.csv")
nypd<-read.csv("NYPD_Complaint_Data_Current__Year_To_Date - filtered.csv")
nypd$CMPLNT_FR_DT <- mdy(nypd$CMPLNT_FR_DT)
nypd$CMPLNT_FR_TM <- format(as.POSIXct(nypd$CMPLNT_FR_TM, format = "%H:%M:%S"), "%H")
nypd$CMPLNT_FR_YR <- format(as.POSIXct(nypd$CMPLNT_FR_DT, format = "%Y-%M-%D"), "%Y")
nypd$CMPLNT_FR_YRMT<-format(nypd$CMPLNT_FR_DT, "%y-%m")
nypd<-nypd%>%
filter(CMPLNT_FR_YR <= 2021)
plot_data<-nypd %>%
filter(CRM_ATPT_CPTD_CD == 'COMPLETED')%>%
filter(VIC_SEX == "F"|VIC_SEX=="M")%>%
select(CMPLNT_FR_TM,VIC_SEX,CMPLNT_NUM) %>%
group_by(CMPLNT_FR_TM,VIC_SEX) %>%
count(CMPLNT_NUM)%>%
group_by(CMPLNT_FR_TM,VIC_SEX) %>%
summarise(Victum_num = sum(n))%>%
arrange(VIC_SEX,CMPLNT_FR_TM)
plot1<-plot_data %>%
ggplot(.,aes(CMPLNT_FR_TM,Victum_num))+
theme_bw()+
geom_point(aes(color=VIC_SEX))+
scale_shape_discrete(guide=FALSE)+
labs(x="Hours", y="Total number of victims", title="Total number of victims across 24 hours")+
theme(plot.title=element_text(hjust=0.5))
interactiveplot1<-ggplotly(plot1) %>%
layout(legend=list(orientation="h", x=0.2, y=-0.2), hovermode="x")
interactiveplot1
We used an interactive point chart to demonstrate the total number of victims over one day (0am - 23pm) and the different color of the dot represents each gender. The curve showed an upward trend from morning to afternoon time. The total number of victims reached peak during 12pm, especially for female victims. Different from what we commonly think, night time from 23 pm to 4pm the number of crimes diminished quickly and reached bottom at 5am. Overall, the number of female crimes is much higher than male.
nypd$day_by_day_in_a_week<- wday(nypd$CMPLNT_FR_DT, label=TRUE)
return_by_hour <- function(x) {
return (as.numeric(strsplit(x,":")[[1]][1]))
}
nypd_by_hour <- nypd %>%
mutate(Hour = sapply(CMPLNT_FR_TM, return_by_hour)) %>%
group_by(day_by_day_in_a_week, Hour) %>%
summarize(count = n())
nypd_by_hour$day_by_day_in_a_week <- factor(nypd_by_hour$day_by_day_in_a_week, level = c("Sun","Mon","Tue","Wed","Thu","Fri","Sat"))
nypd_by_hour$Hour <- factor(nypd_by_hour$Hour, level = 0:23, label = c(0:23))
nypd_by_hour %>%
ggplot(aes(x = Hour, y = day_by_day_in_a_week, fill = count)) + geom_tile() +
scale_fill_continuous(trans = 'reverse') + ggtitle("Number of Crime reported by Type")
We used ggplot to create a heat map in order to more clearly view the crime that happened during one week in New York. The shade of blue represents the number of crimes, the darker the color means the higher the crime. It can be easily found that crime normally happens during noon time, then from 15 - 19 pm during the day. Besides, midnight zero clock also has a high incident hour.
plot_data2<-nypd %>%
filter(CRM_ATPT_CPTD_CD == 'COMPLETED')%>%
filter(SUSP_RACE !="" )%>%
filter(CMPLNT_FR_YR != 2021&CMPLNT_FR_YR != 2020 & CMPLNT_FR_YR>=1969)%>%
mutate(SUSP_RACE=case_when(SUSP_RACE=="BLACK"~"BLACK",
SUSP_RACE=="BLACK HISPANIC"~"BLACK",
SUSP_RACE=="WHITE"~"WHITE",
SUSP_RACE=="WHITE HISPANIC"~"WHITE",
TRUE ~ "OTHER"))%>%
select(CMPLNT_FR_YR,SUSP_RACE,CMPLNT_NUM) %>%
group_by(CMPLNT_FR_YR,SUSP_RACE) %>%
count(CMPLNT_NUM)%>%
group_by(CMPLNT_FR_YR,SUSP_RACE) %>%
summarise(SUSP_num = sum(n))%>%
arrange(CMPLNT_FR_YR,SUSP_RACE)
fig <- plot_ly(plot_data2, x = ~CMPLNT_FR_YR, y = ~SUSP_num, type = 'scatter', mode = '', color = ~SUSP_RACE)
fig <- fig%>%layout(title = 'Total number of suspects each year',
xaxis = list(title = 'Year'),
yaxis = list (title = 'Total number of suspects'))
fig
Then, we used ggplot to explore that kind of race with a high crime suspect possibility. Other race is the most significant race of suspects followed by white and black population. There used to be a similar level from 2019 to 1996. However, the number of other race crimes jumped sharply from less than 50 to over 300 during 2006 to 2019.
plot_data3<-nypd%>%
filter(nypd$CMPLNT_FR_YR == 2021)%>%
filter(BORO_NM!="")%>%
select(CMPLNT_FR_YRMT,BORO_NM,CMPLNT_NUM) %>%
group_by(CMPLNT_FR_YRMT,BORO_NM) %>%
count(CMPLNT_NUM)%>%
group_by(CMPLNT_FR_YRMT,BORO_NM) %>%
summarise(CMPLNT_num = sum(n))%>%
arrange(CMPLNT_FR_YRMT,BORO_NM)
plot3<-plot_ly(plot_data3,x = ~CMPLNT_num, y = ~reorder(BORO_NM, (CMPLNT_num)), type = 'bar',
name = ~CMPLNT_FR_YRMT, color = ~CMPLNT_FR_YRMT) %>%
layout(yaxis = list(title = 'Count'), barmode = 'stack')
plot3 <- plot3%>%layout(title = 'Total number of crimes of each borough in 2021',
xaxis = list(title = 'Name of Borough'),
yaxis = list (title = 'Total number of crimes'))
plot3
Furthermore, we added more factors in ggplot and selected 2021 as target year to explore the total crime data in New York. Brooklyn still has the highest crime rate at more than 120k and distributes quite evenly every month. Manhattan and queens rank second and third place, with around 110k and 90k respectively. More specifically, from October to December, the crime cases increase quicker than other months in 2021.
new<-nypd %>%
filter(BORO_NM !="" )%>%
group_by(BORO_NM, LAW_CAT_CD) %>%
summarize(count = n()) %>%
ggplot(aes(x=count, y=reorder(BORO_NM, -(count)), fill=LAW_CAT_CD)) +
geom_bar(stat="identity") +
coord_flip() +
ggtitle("Number of Crime by Borough and its Crime Type") +
xlab("Percent") + ylab("Name of Borough")
new
This stacked bar chart demonstrates the total number of crimes by borough with its crime type. Brooklyn still has the largest crime case number. Most crime types are misdemeanors, which are represented by green color.
data<-nypd_precinct%>% filter(CRM_ATPT_CPTD_CD !="" )
data <- data %>% group_by(ADDR_PCT_CD)%>%
count(CRM_ATPT_CPTD_CD)
data1 <-data[which(data$CRM_ATPT_CPTD_CD == "ATTEMPTED"),]
data_attempted <-select(data1, c('ADDR_PCT_CD','n'))
names(data_attempted)[1] <- 'police_precinct'
names(data_attempted)[2] <-'attempted'
data2 <-data[which(data$CRM_ATPT_CPTD_CD == "COMPLETED"),]
data_completed <-select(data2, c('ADDR_PCT_CD','n'))
names(data_completed)[1] <- 'police_precinct'
names(data_completed)[2] <-'completed'
data_total <- merge(x=data_attempted, y=data_completed, by= "police_precinct")
data_total$total_crime <- data_total$attempted + data_total$completed
datajson <- geojsonio::geojson_read("https://data.beta.nyc/dataset/5ed20732-5cf9-4812-b8ac-70ad4d10a1ca/resource/375dcf37-5cd9-4c74-9c53-c638b6bb62d0/download/742720184001424d85664732f950040apoliceprecincts.geojson", what = "sp")
bins <- c(0, 2000, 4000, 6000, 8000, 10000, 12000, Inf)
pal <- colorBin("YlOrRd", domain = data_total$total_crime, bins = bins)
leaflet(datajson) %>%
addProviderTiles("MapBox", options = providerTileOptions(
id = "mapbox.light",
accessToken = Sys.getenv('MAPBOX_ACCESS_TOKEN')))%>%
addProviderTiles("Stamen.TonerLite") %>%
addPolygons(
fillColor = ~pal(data_total$total_crime),
weight = 2,
opacity = 1,
color = "white",
dashArray = "3",
fillOpacity = 0.7,
highlightOptions = highlightOptions(
weight = 5,
color = "#666",
dashArray = "",
fillOpacity = 0.7,
bringToFront = TRUE),
label=paste('Number of Total Crime:', data_total$total_crime,
'; Number of Completed Crime:',data_total$completed,
'; Number of Attempted Crime:',data_total$attempted,
'; Police Precinct:',data_total$police_precinct)
)
We used a leaflet to draw a geo-graph which contains demographic variables in the Census zip code data that may relate to the crime rate in New York City. The graph illustrates that the crime rates are really different in various districts. The darker the color, the higher crime rates. The areas with high proportions are distributed by Brooklyn and Manhattan, especially in the east side of Brooklyn and west upper side of Manhattan.
nypd <- nypd %>% slice_sample(n = 100)
pal = colorFactor("Set1", domain = nypd$LAW_CAT_CD)
color = pal(nypd$LAW_CAT_CD)
popup_info <- paste("Status of Crime:",nypd$CRM_ATPT_CPTD_CD,
"Specific location of occurrence:",nypd$LOC_OF_OCCUR_DESC,
"Patrol Borough:",nypd$PATROL_BORO)
leaflet(nypd) %>%
addProviderTiles("Stamen.TonerLite") %>% #<<
addCircles(col=color,popup = popup_info)%>%
addLegend(pal = pal,values = nypd$LAW_CAT_CD, title = "Level of Offense")
Lastly, we also used graphs to demonstrate the level of offense distributed in New York. Red means Felony. Misdemeanor is a blue spot. Violation is the green dot. The blue color, which is a misdemeanor, scatters all over the New York City area. It has the highest happen rate which is concentrated in the Brooklyn and Manhattan area. Then there is the felony which is much lower than the first crime type. The violation just maintains a small portion of the whole area.
In Summary, the data visualization helps us better understand the crime destruction and pattern in New York city. Contrary to what we used to believe crime happens during the time, noon has the highest occurrence possibility. Besides Brooklyn has a high crime report rate, we should pay attention to the safety issue in this area.